Javier Orman
LinkedIn: https://www.linkedin.com/in/javierorman/
GitHub: https://github.com/javierorman
In this notebook, I analyze data on Chicago Public Schools. I obtained the data by scraping a few websites containing survey and income information. The goal is to be able to answer the following questions:
I was able to gather the data using web-scrapers, available on the GitHub of this project.
The school_data information comes from https://5-essentials.org/cps/5e/2018/ and it contains 5 different ratings per school (values 0-100): ambitious instruction, collaborative teachers, effective leaders, involved families and supportive environment. The scraper also obtained the name of each school and full address.
The incomes data was scraped from this website: https://www.zipdatamaps.com/zipcodes-chicago-il. It contains the median household income for each zip code.
# Data Analysis
import numpy as np
import pandas as pd
# Visualization
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib import animation
%matplotlib inline
import seaborn as sns
import plotly.express as px
plt.style.use('seaborn')
plt.rcParams['figure.dpi'] = 100
viz_config = {'color': 'c', 'alpha': 0.4}
sns.set_context('paper')
# Machine Learning
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.cluster import MeanShift
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import estimate_bandwidth
from sklearn.cluster import AffinityPropagation
from sklearn.mixture import GaussianMixture
from sklearn.mixture import BayesianGaussianMixture
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_samples, silhouette_score
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.manifold import TSNE
random_state=42
school_data = pd.read_csv('raw_data/schools.csv', header=0, index_col=0)
income_data = pd.read_csv('raw_data/incomes.csv', header=0, names=['zip', 'income'])
school_data.head()
| Address | Ambitious_Instruction | Collaborative_Teachers | Effective_Leaders | Involved_Families | Supportive_Environment | |
|---|---|---|---|---|---|---|
| A.N. Pritzker School | 2009 W Schiller St, Chicago, IL 60622 | 63 | 46 | 42 | 70 | 45 |
| ACE Technical Charter School | 5410 S State St, Chicago, IL 60609 | 79 | 71 | 65 | 70 | 46 |
| ASPIRA Business and Finance | 2989 N Milwaukee Ave, Chicago, IL 60618 | 55 | 32 | 16 | 25 | 49 |
| ASPIRA Charter School - Early College High School | 3986 W Barry Ave, Chicago, IL 60618 | 72 | 64 | 52 | 75 | 64 |
| ASPIRA Charter School - Haugan Middle School | 3729 W Leland Ave, Chicago, IL 60625 | 99 | 87 | 80 | 80 | 60 |
There is a mistake in the data: the zip code for George Washington Carver Military Academy HS is supposed to be 60827, not 60627.
# Function to extract zip code out of an address
get_zip = lambda x: x.split()[-1]
# View the problematic entry
cond = school_data['Address'].apply(get_zip) == '60627'
school_data.loc[cond]
| Address | Ambitious_Instruction | Collaborative_Teachers | Effective_Leaders | Involved_Families | Supportive_Environment | |
|---|---|---|---|---|---|---|
| George Washington Carver Military Academy HS | 13100 S Doty Ave, Chicago, IL 60627 | 75 | 74 | 59 | 82 | 56 |
# Replace zip code with correct one: 60827
school_data.loc[cond, 'Address'] = '13100 S Doty Ave, Chicago, IL 60827'
# Check that the replacement worked
school_data.loc[cond]
| Address | Ambitious_Instruction | Collaborative_Teachers | Effective_Leaders | Involved_Families | Supportive_Environment | |
|---|---|---|---|---|---|---|
| George Washington Carver Military Academy HS | 13100 S Doty Ave, Chicago, IL 60827 | 75 | 74 | 59 | 82 | 56 |
school_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 655 entries, A.N. Pritzker School to Young Women's Leadership Charter School Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Address 655 non-null object 1 Ambitious_Instruction 655 non-null int64 2 Collaborative_Teachers 655 non-null int64 3 Effective_Leaders 655 non-null int64 4 Involved_Families 655 non-null int64 5 Supportive_Environment 655 non-null int64 dtypes: int64(5), object(1) memory usage: 35.8+ KB
school_data.describe()
| Ambitious_Instruction | Collaborative_Teachers | Effective_Leaders | Involved_Families | Supportive_Environment | |
|---|---|---|---|---|---|
| count | 655.000000 | 655.000000 | 655.000000 | 655.000000 | 655.000000 |
| mean | 69.600000 | 59.981679 | 53.824427 | 61.665649 | 50.581679 |
| std | 23.629764 | 19.359696 | 17.555800 | 21.361594 | 21.584674 |
| min | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 |
| 25% | 61.500000 | 49.000000 | 44.000000 | 51.000000 | 39.500000 |
| 50% | 74.000000 | 62.000000 | 56.000000 | 63.000000 | 51.000000 |
| 75% | 86.000000 | 73.000000 | 66.000000 | 77.000000 | 63.000000 |
| max | 99.000000 | 99.000000 | 98.000000 | 99.000000 | 99.000000 |
The web-crawling 'spider' inserted -1 for missing values. Schools that didn't have numbers for any of the categories were skipped in the process.
# Replace -1 with NaN
school_data[(school_data == -1)] = np.NaN
# Count NaNs by column
school_data.isna().sum()
Address 0 Ambitious_Instruction 43 Collaborative_Teachers 14 Effective_Leaders 15 Involved_Families 24 Supportive_Environment 43 dtype: int64
# Count rows with too many NaNs
max_nans_allowed = 2
rows_too_many_nans = school_data.isna().sum(axis=1) > max_nans_allowed
len(school_data[rows_too_many_nans])
15
If we allow a maximum of 1 missing value per row, we lose 58 instances... almost 10% of our dataset. With max_nans_allowed set at 2, we only lose 15 instances.
# New DataFrame without too many NaNs on each row
school_df = school_data[~rows_too_many_nans]
# Create separate Series for addresses
school_addresses = school_df.iloc[:, 0]
# Discard 'Address' column from DataFrame
school_df = school_df.iloc[:, 1:]
school_addresses
A.N. Pritzker School 2009 W Schiller St, Chicago, IL 60622
ACE Technical Charter School 5410 S State St, Chicago, IL 60609
ASPIRA Business and Finance 2989 N Milwaukee Ave, Chicago, IL 60618
ASPIRA Charter School - Early College High School 3986 W Barry Ave, Chicago, IL 60618
ASPIRA Charter School - Haugan Middle School 3729 W Leland Ave, Chicago, IL 60625
...
YCCS - Sullivan House Alternative HS 8164 S South Chicago Ave, Chicago, IL 60617
YCCS - Truman Middle College HS 1145 W Wilson Ave, Chicago, IL 60640
YCCS - West Town Acad Alternative HS 500 N Sacramento Blvd, Chicago, IL 60612
YCCS - Youth Connection Leadership Acad HS 3424 S State St, Chicago, IL 60616
Young Women's Leadership Charter School 2641 S Calumet Ave, Chicago, IL 60616
Name: Address, Length: 640, dtype: object
# Fill NaNs with their respective row's mean
school_df = school_df.apply(lambda row: row.fillna(row.median()), axis=1)
school_df.describe()
| Ambitious_Instruction | Collaborative_Teachers | Effective_Leaders | Involved_Families | Supportive_Environment | |
|---|---|---|---|---|---|
| count | 640.000000 | 640.000000 | 640.000000 | 640.000000 | 640.000000 |
| mean | 74.139062 | 61.329687 | 55.126562 | 63.998438 | 54.900000 |
| std | 15.274362 | 17.325542 | 15.608410 | 17.861929 | 17.665099 |
| min | 16.000000 | 16.000000 | 4.000000 | 7.000000 | 12.000000 |
| 25% | 64.000000 | 50.000000 | 46.000000 | 52.000000 | 42.000000 |
| 50% | 75.000000 | 63.000000 | 56.000000 | 64.000000 | 53.000000 |
| 75% | 87.000000 | 74.000000 | 66.000000 | 78.000000 | 66.000000 |
| max | 99.000000 | 99.000000 | 98.000000 | 99.000000 | 99.000000 |
From these summary statistics and the following histograms, we see that there are no invalid values such as negatives, numbers over 100 and NaNs.
school_df.hist(figsize=(12,12), **viz_config);
income_data.head()
| zip | income | |
|---|---|---|
| 0 | 60007 | 75743.0 |
| 1 | 60018 | 60566.0 |
| 2 | 60106 | 67325.0 |
| 3 | 60131 | 57815.0 |
| 4 | 60176 | 53360.0 |
income_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 68 entries, 0 to 67 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 zip 68 non-null int64 1 income 67 non-null float64 dtypes: float64(1), int64(1) memory usage: 1.2 KB
income_data.isna().sum()
zip 0 income 1 dtype: int64
# Find rows with income NaN
cond_isna = income_data.iloc[:, 1].isna()
income_data.loc[cond_isna]
| zip | income | |
|---|---|---|
| 44 | 60642 | NaN |
# Find rows with income 0
cond_zero = income_data.iloc[:, 1] == 0
income_data.loc[cond_zero]
| zip | income | |
|---|---|---|
| 54 | 60654 | 0.0 |
| 61 | 60666 | 0.0 |
# Calculate median for the income column
income_median = income_data.iloc[:,1].median()
# Assign income_median to the locations where conditions return True
income_data.iloc[(cond_isna | cond_zero).values, 1] = income_median
# Describe the income column
income_data.income.describe()
count 68.000000 mean 56778.794118 std 23573.079579 min 18464.000000 25% 42493.250000 50% 54344.000000 75% 63822.500000 max 171463.000000 Name: income, dtype: float64
income_data.income.hist(bins=20, **viz_config)
plt.xlim(0)
plt.show()
# Create zip column in school_df to use in the merge
school_df['zip'] = school_addresses.apply(lambda x: x.split()[-1])
school_df['zip'] = school_df['zip'].astype('int64')
school_df.head()
| Ambitious_Instruction | Collaborative_Teachers | Effective_Leaders | Involved_Families | Supportive_Environment | zip | |
|---|---|---|---|---|---|---|
| A.N. Pritzker School | 63.0 | 46.0 | 42.0 | 70.0 | 45.0 | 60622 |
| ACE Technical Charter School | 79.0 | 71.0 | 65.0 | 70.0 | 46.0 | 60609 |
| ASPIRA Business and Finance | 55.0 | 32.0 | 16.0 | 25.0 | 49.0 | 60618 |
| ASPIRA Charter School - Early College High School | 72.0 | 64.0 | 52.0 | 75.0 | 64.0 | 60618 |
| ASPIRA Charter School - Haugan Middle School | 99.0 | 87.0 | 80.0 | 80.0 | 60.0 | 60625 |
merged_df = pd.merge(school_df, income_data, how='left', on='zip')
merged_df.head()
| Ambitious_Instruction | Collaborative_Teachers | Effective_Leaders | Involved_Families | Supportive_Environment | zip | income | |
|---|---|---|---|---|---|---|---|
| 0 | 63.0 | 46.0 | 42.0 | 70.0 | 45.0 | 60622 | 54344.0 |
| 1 | 79.0 | 71.0 | 65.0 | 70.0 | 46.0 | 60609 | 31357.0 |
| 2 | 55.0 | 32.0 | 16.0 | 25.0 | 49.0 | 60618 | 55303.0 |
| 3 | 72.0 | 64.0 | 52.0 | 75.0 | 64.0 | 60618 | 55303.0 |
| 4 | 99.0 | 87.0 | 80.0 | 80.0 | 60.0 | 60625 | 53288.0 |
# Delete the zip column
merged_df.drop('zip', axis=1, inplace=True)
merged_df.describe()
| Ambitious_Instruction | Collaborative_Teachers | Effective_Leaders | Involved_Families | Supportive_Environment | income | |
|---|---|---|---|---|---|---|
| count | 640.000000 | 640.000000 | 640.000000 | 640.000000 | 640.000000 | 640.000000 |
| mean | 74.139062 | 61.329687 | 55.126562 | 63.998438 | 54.900000 | 45439.895312 |
| std | 15.274362 | 17.325542 | 15.608410 | 17.861929 | 17.665099 | 13279.991805 |
| min | 16.000000 | 16.000000 | 4.000000 | 7.000000 | 12.000000 | 18464.000000 |
| 25% | 64.000000 | 50.000000 | 46.000000 | 52.000000 | 42.000000 | 34670.000000 |
| 50% | 75.000000 | 63.000000 | 56.000000 | 64.000000 | 53.000000 | 45717.000000 |
| 75% | 87.000000 | 74.000000 | 66.000000 | 78.000000 | 66.000000 | 54344.000000 |
| max | 99.000000 | 99.000000 | 98.000000 | 99.000000 | 99.000000 | 93702.000000 |
All features except for income have similar ranges: 0 to 100 (or 99 if we want to be precise). In order for some of the clustering algorithms to properly detect patterns in the data, we need income to be in a similar range as the other features.
max_income = max(merged_df['income'])
merged_df['income'] = (merged_df['income'] / max_income) * 100
merged_df.describe()
| Ambitious_Instruction | Collaborative_Teachers | Effective_Leaders | Involved_Families | Supportive_Environment | income | |
|---|---|---|---|---|---|---|
| count | 640.000000 | 640.000000 | 640.000000 | 640.000000 | 640.000000 | 640.000000 |
| mean | 74.139062 | 61.329687 | 55.126562 | 63.998438 | 54.900000 | 48.494051 |
| std | 15.274362 | 17.325542 | 15.608410 | 17.861929 | 17.665099 | 14.172581 |
| min | 16.000000 | 16.000000 | 4.000000 | 7.000000 | 12.000000 | 19.705022 |
| 25% | 64.000000 | 50.000000 | 46.000000 | 52.000000 | 42.000000 | 37.000277 |
| 50% | 75.000000 | 63.000000 | 56.000000 | 64.000000 | 53.000000 | 48.789780 |
| 75% | 87.000000 | 74.000000 | 66.000000 | 78.000000 | 66.000000 | 57.996628 |
| max | 99.000000 | 99.000000 | 98.000000 | 99.000000 | 99.000000 | 100.000000 |
sns.pairplot(merged_df,
diag_kind='hist',
plot_kws=viz_config,
diag_kws=viz_config)
plt.xlim(0,100)
plt.ylim(0,100)
plt.show();
Although it's impossible to draw conclusions on the data's cluster structure from looking at 1 or 2 variables at a time, we need to note that there are no clear separations that we can establish visually just yet.
get_fresh_X = lambda : merged_df.values
X = get_fresh_X()
def get_cluster_counts(model_class, param, values, additional_params={}):
'''
Given a model class name, a parameter name and a list of values,
get the resulting number of clusters for each each of those values.
'''
cluster_counts = []
print(param + ':')
for value in values:
print(str(value) + ' ', end='')
param_dict = {param: value}
param_dict.update(additional_params)
model = model_class(**param_dict).fit(X)
cluster_count = len(model.cluster_centers_)
cluster_counts.append(cluster_count)
print('')
print('Cluster counts:')
print(cluster_counts)
return cluster_counts
def get_final_params(initial_param_values, cluster_counts, min_clusters=2, max_clusters=10):
'''
Given an initial list of parameter values and how many clusters were formed for each of them,
return a list of unique parameter values that caused cluster counts within the range determined by
min_clusters and max_clusters.
'''
range_clusters = range(min_clusters, max_clusters + 1)
final_values = []
for i in range_clusters:
# Find the index in param_values where cluster_counts equals i
idx = np.argmax(np.array(cluster_counts) == i)
if idx == 0: # No cluster_counts are less than i, so np.argmax returned 0
continue
value = initial_param_values[idx] # Value of parameter at index idx in param_values
print('parameter value: ' + str(value), '--- ' + 'number of clusters: ' + str(i))
final_values.append(value)
final_values.sort()
return final_values
def scatter_labels(model_name, labels, centers=None, X=X, x=0, y=1, cmap=cm.Set2, ax=None, title=False):
'''
Make a scatter plot with colored clusters.
'''
# Prepare colors
n_clusters = len(set(labels))
colors = cmap(labels.astype(float) / n_clusters)
# Make figure
if not ax:
fig, ax = plt.subplots()
ax.scatter(X[:, x], X[:, y], c=colors, s=4)
ax.set_xlabel(merged_df.columns[x])
ax.set_ylabel(merged_df.columns[y])
ax.set_xticks([])
ax.set_yticks([])
ax.set_xlim([0,102])
ax.set_ylim([0,102])
if title:
ax.set_title(f'Model: {model_name}, Clusters: {n_clusters}', fontweight='bold')
else:
ax.set_title(f'Clusters: {n_clusters}')
if centers is not None:
# Draw white circles at cluster centers
ax.scatter(centers[:, x], centers[:, y], marker='o',
c="white", alpha=1, s=100, edgecolor='k')
for i, c in enumerate(centers):
ax.scatter(c[x], c[y], marker='$%d$' % i, alpha=1,
s=25, edgecolor='k')
def plot_params(model_class, param, param_values, additional_params={}, figsize=(12,12)):
'''
Create a grid of cluster-colored scatter plots corresponding to different paramater values for a model.
'''
col_tuples = [(0, 1), (2, 3), (4, 5)]
n_values = len(param_values)
fig, ax_rows = plt.subplots(3, n_values, figsize=figsize)
fig.suptitle(model_class.__name__, fontweight='bold')
for value_idx, value in enumerate(param_values):
# Build dictionary with parameters
param_dict = {param: value}
param_dict.update(additional_params)
# Fit the model
model = model_class(**param_dict).fit(X)
# Get labels
try:
labels = model.labels_
except:
labels = model.predict(X)
# Get cluster centers
try:
centers = model.cluster_centers_
except:
centers = None
# Get the model name
model_name = model.__class__.__name__
row = value_idx
for x, y in col_tuples:
plt.tight_layout()
col = x // 2
ax = ax_rows[col][row]
scatter_labels(model_name, labels=labels, centers=centers, x=x, y=y, ax=ax)
### BASED ON https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html###
def plot_silhouettes(X, n_clusters, labels, title='', metric='euclidean', low_x=-0.1, ax=None, figsize=(6, 4)):
'''
Plot silhouette diagrams.
'''
# Create and format figure and axes
if ax is None: # Usually if we're plotting an individual diagram
fig, ax = plt.subplots(figsize=figsize)
ax.set_title(title, fontsize=12, fontweight='bold')
ax.set_xlim([low_x, 1])
# Compute the silhouette score average for all samples
silhouette_avg = silhouette_score(X, labels, metric=metric)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, labels, metric=metric)
# Separation between clusters' silhouette plots
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.Set2(float(i) / n_clusters)
ax.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
# The vertical line for average silhouette score of all the values
ax.axvline(x=silhouette_avg, color="red", linestyle="--")
ax.set_yticks([]) # Clear the yaxis labels / ticks
ax.set_xticks([0, silhouette_avg, 1])
Scikit-Learn offers implementations for a variety of clustering algorithms. Each one of these has their own hyper-parameters that can be tweaked and their own 'preferences' as to what types of clusters they are able to distinguish. I will provide a short description for each of these algorithms.
We'll also evaluate how each algorithm performs under different hyper-parameter values. Because we're dealing with unsupervised learning, we do not have a 'ground truth' to which we can compare the cluster assignments. However, we can use the silhouette coefficient, which measure how close each instance is to the other instances of their own cluster, compared to the instances in the next closest cluster.
We can interpret the silhouette coefficient as follows:
Silhouette coefficients can be visualized in a silhouette diagram, where samples are grouped by cluster and their silhouette coefficients plotted horizonatally. Ideally, we'd like all clusters to cross the average silhouette coefficient (the silhouette score) and contain as few negative values as possible.
In addition to silhouette diagrams, we can do something much less formal: visualize the cluster assignments on the features themselves. Because we cannot visualize 6 dimensions at a time, we'll just use 3 pairs of features and be content for now. These visualizations should not be used to draw conclusions but merely to partially observe the results of the algorithms.
Once we narrow down our options, we'll compute a 3-D projection of the dataset using t-SNE. Although we will lose all concept of units at that point, we'll be able to visualize the results in a more informative way.
X = get_fresh_X()
Probably the most popular of the bunch, K-Means groups data points using distances. The process amounts to these basic steps:
In K-Means, we need to tell the algorithm how many clusters we'd like to see. We can evaluate different numbers k of clusters with silhouette diagrams.
This algorithm expects clusters to be of globular (spherical) shapes and similar sizes. It also assigns every instance to a cluster, including 'noise' points.
We start by exploring the algorithm with k=5 number of clusters:
k=5
kmeans = KMeans(n_clusters=k, random_state=random_state)
kmeans.fit(X)
KMeans(n_clusters=5, random_state=42)
# The location of all 5 cluster centers
kmeans.cluster_centers_
array([[85.08088235, 60.95588235, 54.83823529, 57.85294118, 65.69852941,
39.71649171],
[62.70700637, 52.39490446, 47.46496815, 55.19426752, 42.98726115,
49.73684021],
[91.05172414, 82.20689655, 72.99137931, 84.3362069 , 76.07758621,
48.32514707],
[72.42675159, 69.17834395, 62.07643312, 74.89171975, 49.82165605,
57.23448112],
[55.40540541, 31.59459459, 29.16216216, 38.97972973, 37.90540541,
43.70993264]])
# Clusters assigned to first 20 instances
kmeans.labels_[:20]
array([1, 3, 4, 3, 2, 3, 2, 1, 1, 1, 2, 4, 1, 1, 3, 4, 4, 3, 4, 1],
dtype=int32)
Compare KMeans for k values from 2 to 10:
kmeans_per_k = [KMeans(n_clusters=k, random_state=random_state).fit(X) for k in range(2,11)]
# Silhouette scores for all values of k
silhouette_scores = [silhouette_score(X, model.labels_) for model in kmeans_per_k]
plt.bar(range(2,11), silhouette_scores, **viz_config)
plt.xlabel('Number \'k\' of clusters')
plt.ylabel('Silhouette Score');
# Plot silhouette diagrams
range_n_clusters = range(2,10)
indices = np.array(range_n_clusters) - np.array(range_n_clusters)[0]
fig, axs = plt.subplots(2, 4, figsize=(12,8))
axs = axs.flatten()
fig.suptitle("Silhouette analysis for KMeans clustering",
fontsize=20, fontweight='bold')
fig.text(0.5, -0.004, "Silhouette coefficient values", ha="center", va="center", size=16)
fig.text(-0.005, 0.5, "Cluster label", ha="center", va="center", rotation=90, size=16)
for idx, n in zip(indices, range_n_clusters):
model = KMeans(n_clusters=n, random_state=random_state)
labels = model.fit_predict(X)
ax = axs[idx]
plot_silhouettes(X, n, labels, metric='sqeuclidean', ax=ax)
plt.tight_layout()
plt.show()
plot_params(KMeans, 'n_clusters', [2, 3, 4], {'random_state': random_state}, figsize=(12,12))
From sklearn's documentation: DBSCAN finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.
The algorithm finds core instances by determining if they have enough neighbors min_samples within a certain distance epsilon. All instances within in the epsilon-neighborhood of a core distance are assigned the same cluster. An instance that is not in the neighborhood of a core distance is an anomaly.
This algorithm works with clusters of any size and shape. The challenge when using DBSCAN is tuning the hyperparameters eps and min_samples.
In order to find a good value for epsilon, we'll first get an idea of how far the samples are from each other. We can get distance between each instance and its nearest neighbor by using the NearestNeighbors class. This is an unsupervised algorithm that computes and returns, for each instance, the distance to its closest neighbor and the index of that neighbor. We'll then sort and plot these distances.
### Reference: https://towardsdatascience.com/machine-learning-clustering-dbscan-determine-the-optimal-value-for-epsilon-eps-python-example-3100091cfbc#
# n_neighbors includes the instance itself
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X)
distances, indices = nbrs.kneighbors(X)
# 0th column is all 0s: distance between each instance and itself
# 1st column is distance between each distance and its closest neighbor
# Here is a (sorted) list of those distances for 20 instances
distances = np.sort(distances[:,1], axis=0)
distances[:20]
array([2.09486124, 2.09486124, 4.35889894, 4.35889894, 4.47940371,
4.47940371, 4.51535642, 5.07563736, 5.07563736, 5.2824968 ,
5.2824968 , 5.31696127, 5.31696127, 5.3263842 , 5.3263842 ,
5.61147354, 5.61147354, 5.83652787, 5.83652787, 5.85864328])
plt.figure(figsize=(8,6))
plt.plot(distances, color='c')
plt.xlim(0)
plt.xlabel('All instances, sorted by distance to their closest neighbor')
plt.ylabel('Distance to closest neighbor')
plt.title('Distances as computed by NearestNeighbors', fontsize=12, fontweight='bold')
plt.show();
We can see that most (almost all) of the instances are at a distance of 20 or less from their nearest neighbor. We can safely say that by setting epsilon somewhere between 16 and 21, DBSCAN will find regions of high-density and discard instances that are far away from any dense areas (anomalies). I will choose 18 for now and tweak later if needed.
### Reference: https://medium.com/@mohantysandip/a-step-by-step-approach-to-solve-dbscan-algorithms-by-tuning-its-hyper-parameters-93e693a91289
dbscan = DBSCAN(eps=18, min_samples=5)
dbscan.fit(X)
DBSCAN(eps=18)
pd.Series(dbscan.labels_).value_counts()
0 568 -1 72 dtype: int64
By setting min_samples to 5, we obtained one cluster ('0') with 568 instances and 72 outliers, which were assigned -1. Although we didn't get multiple clusters, this could still be interesting information. By looking at the table below, we can see that many of this outliers had a large spread of values across the different variables. There could be other reasons why these instances are considered anomalous by the algorithm, but is subject to more analysis.
outliers_df = merged_df[dbscan.labels_ == -1]
outliers_df
| Ambitious_Instruction | Collaborative_Teachers | Effective_Leaders | Involved_Families | Supportive_Environment | income | |
|---|---|---|---|---|---|---|
| 5 | 68.0 | 77.0 | 74.0 | 92.0 | 66.0 | 100.000000 |
| 11 | 43.0 | 28.0 | 25.0 | 51.0 | 22.0 | 49.957311 |
| 18 | 62.0 | 16.0 | 30.0 | 51.0 | 46.0 | 49.957311 |
| 34 | 70.0 | 49.0 | 57.0 | 79.0 | 70.0 | 79.928924 |
| 53 | 85.0 | 54.0 | 57.0 | 84.0 | 54.0 | 79.928924 |
| ... | ... | ... | ... | ... | ... | ... |
| 628 | 41.0 | 82.0 | 72.0 | 24.0 | 44.0 | 44.514525 |
| 630 | 67.0 | 23.0 | 29.0 | 25.0 | 57.0 | 57.996628 |
| 631 | 47.0 | 45.0 | 48.0 | 18.0 | 47.0 | 26.970609 |
| 633 | 57.0 | 18.0 | 7.0 | 33.5 | 49.0 | 43.874197 |
| 636 | 62.0 | 48.0 | 45.0 | 32.0 | 64.0 | 47.394933 |
72 rows × 6 columns
scatter_labels(dbscan, (dbscan.labels_ + 1), title=True)
From Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow by Aurélien Géron: A hierarchy of clusters is built from the bottom up. Think of many tiny bubbles floating on water and gradually attaching to each other until there's one big group of bubbles.
A nice aspect of this algorithm is that we can visualize how it works by plotting a dendrogram. As shown below, this lets us visualize the sequential pairing, first of individual instances, then of the combinations of instances until there only few clusters and finally just one. We can then decide how many clusters we want to end up with and compare the results with the silhouette diagrams.
X = get_fresh_X()
mergings = linkage(X, 'ward')
dendrogram(mergings, no_labels=True)
plt.show();
Looking at the dendrogram, it seems like n_clusters should be 2, 3 or 4.
range_n_clusters = range(2,5)
indices = np.array(range_n_clusters) - np.array(range_n_clusters)[0]
fig, axs = plt.subplots(1, 3, figsize=(14, 6))
axs = axs.flatten()
fig.suptitle("Silhouette analysis for Agglomerative Clustering",
fontsize=20, fontweight='bold')
fig.text(0.5, -0.004, "Silhouette coefficient values", ha="center", va="center", size=16)
fig.text(-0.005, 0.5, "Cluster label", ha="center", va="center", rotation=90, size=16)
for idx, n in zip(indices, range_n_clusters):
model = AgglomerativeClustering(n_clusters=n)
labels = model.fit_predict(X)
ax = axs[idx]
plot_silhouettes(X, n, labels, ax=ax, metric='sqeuclidean', low_x=-0.5)
plt.tight_layout()
plt.show()
plot_params(AgglomerativeClustering, 'n_clusters', [2, 3, 4])
Just as with DBSCAN, here we consider the neighborhood of each instance:
X = get_fresh_X()
# sklearn's estimate_bandwidth gives us a good starting point for trying out different values
bandwidth = estimate_bandwidth(X, quantile=0.1, random_state=random_state)
bandwidth
29.677654451771637
bw_range = np.arange(10, 60, 2)
### Training can take a long time, so the result is hardcoded below
# ms_cluster_counts = get_cluster_counts(MeanShift, 'bandwidth', bw_range)
ms_cluster_counts = [495, 356, 229, 132, 74, 43, 24, 14, 5, 4, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
plt.plot(bw_range, ms_cluster_counts, color='c')
plt.xlabel('Bandwidth')
plt.ylabel('Cluster count')
plt.title('Mean Shift: how different bandwidths affect cluster count', fontsize=12, fontweight='bold')
plt.show()
bws = get_final_params(initial_param_values=bw_range,
cluster_counts=ms_cluster_counts)
parameter value: 30 --- number of clusters: 3 parameter value: 28 --- number of clusters: 4 parameter value: 26 --- number of clusters: 5
fig, axs = plt.subplots(1, 3, figsize=(12, 6))
axs = axs.flatten()
fig.suptitle("Silhouette analysis for Mean Shift",
fontsize=20, fontweight='bold')
fig.text(0.5, -0.004, "Silhouette coefficient values", ha="center", va="center", size=16)
fig.text(-0.005, 0.5, "Cluster label", ha="center", va="center", rotation=90, size=16)
for idx, bw in enumerate(bws):
model = MeanShift(bandwidth=bw)
labels = model.fit_predict(X)
ax = axs[idx]
plot_silhouettes(X, n, labels, ax=ax, metric='sqeuclidean', low_x=-0.5)
plt.tight_layout()
plt.show()
plot_params(MeanShift, 'bandwidth', bws)
The description from the hdbscan library documentation is succint: Affinity Propagation is a newer clustering algorithm that uses a graph based approach to let points ‘vote’ on their preferred ‘exemplar’. The end result is a set of cluster ‘exemplars’ from which we derive clusters by essentially doing what K-Means does and assigning each point to the cluster of it’s nearest exemplar.
As with KMeans, this algorithm excepts clusters to be globular. A big plus is that it automatically gets the number k of clusters for us. A drawback is that is very slow.
There are not many hyper-parameters to finetune. We'll try differnt values for preference, which determines how much each instance will "want" to be come an exemplar or representative. The higher the value, the more clusters we'll get.
X = get_fresh_X()
affinity = AffinityPropagation(random_state=random_state).fit(X)
prefs_range = np.arange(-1000000, 0, 10000)
### Training can take a long time, so the result is hardcoded below
# ap_cluster_counts = get_cluster_counts(AffinityPropagation,
# 'preference',
# np.arange(-1000000, 0, 10000),
# {'damping': 0.9, 'random_state': random_state})
ap_cluster_counts = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 4, 5, 5, 9, 11]
prefs = get_final_params(prefs_range, ap_cluster_counts)
parameter value: -330000 --- number of clusters: 2 parameter value: -170000 --- number of clusters: 3 parameter value: -50000 --- number of clusters: 4 parameter value: -40000 --- number of clusters: 5 parameter value: -20000 --- number of clusters: 9
fig, axs = plt.subplots(2, 3, figsize=(12, 8))
axs = axs.flatten()
fig.suptitle("Silhouette analysis for Affinity Propagation",
fontsize=20, fontweight='bold')
fig.text(0.5, -0.004, "Silhouette coefficient values", ha="center", va="center", size=16)
fig.text(-0.005, 0.5, "Cluster label", ha="center", va="center", rotation=90, size=16)
additional_params = {'damping': 0.9,
'random_state': random_state}
for idx, pref in enumerate(prefs):
model = AffinityPropagation(preference=pref, **additional_params)
labels = model.fit_predict(X)
ax = axs[idx]
plot_silhouettes(X, n, labels, ax=ax, metric='sqeuclidean', low_x=-0.5)
fig.delaxes(axs[-1])
plt.tight_layout()
plt.show()
plot_params(AffinityPropagation, 'preference', prefs, additional_params, figsize=(16,12))
A Gaussian Mixture Model (GMM) starts off with the assumption that the data was produced from a number of multi-variate Gaussian (Normal) distributions. It then tries to find the parameters of those distributions, namely the means and covariate matrices. The latter give the distributions their variance along each dimension as well as the orientation, given by the covariances. Another parameter is a weight that is assigned to each distribution.
The algorithm for finding the distributions is called Expectation-Maximization. After telling the model how many clusters we'd like, these are the basic steps that EM follows:
In sklearn, covariance_type can be set to 'full' (it is by default), which allows the clusters to take any shape, size and orientation. This makes GM extremely flexible.
range_n_components = np.arange(2, 6)
fig, axs = plt.subplots(1, 4, figsize=(16, 8))
axs = axs.flatten()
fig.suptitle("Silhouette analysis for Gaussian Mixture",
fontsize=20, fontweight='bold')
fig.text(0.5, -0.004, "Silhouette coefficient values", ha="center", va="center", size=16)
fig.text(-0.005, 0.5, "Cluster label", ha="center", va="center", rotation=90, size=16)
additional_params = {'random_state': random_state,
'tol': 0.1,
'max_iter': 500}
for idx, n in enumerate(range_n_components):
model = GaussianMixture(n_components=n, **additional_params)
labels = model.fit_predict(X)
ax = axs[idx]
plot_silhouettes(X, n, labels, ax=ax, metric='sqeuclidean',low_x=-0.5)
plt.tight_layout()
plt.show()
plot_params(GaussianMixture, 'n_components', range_n_components, additional_params)
In this extension of GMMs, the cluster parameters (means, covariance matrices and weights) are themselves treated as latent random variables which come from certain prior distributions whose parameters the algorithm will try to find.
A drawback of this algorithm is that it prefers ellipsoidal shapes, so it won't work well with other data configurations.
bgm = BayesianGaussianMixture(n_components=5, n_init=50, max_iter=1000, random_state=random_state).fit(X)
labels = bgm.predict(X)
centers = bgm.means_
n_clusters = len(centers)
plot_silhouettes(X, n_clusters, labels, title="Silhouette analysis for Gaussian Mixture",
metric='sqeuclidean', low_x=-1, ax=None, figsize=(6, 5))
fig, axs = plt.subplots(1, 3, figsize=(12,3))
fig.suptitle('Bayesian Gaussian Mixture')
plt.tight_layout()
for i in [0, 1, 2]:
x = i * 2
y = (i * 2) + 1
scatter_labels(bgm, labels, centers, x=x, y=y, ax=axs[i])
From looking at the silhouette scores and diagrams, these seem to be the best models for the task:
KMeans:
n_clusters = 2
n_clusters = 3
AffinityPropagation:
preference = -330000 ---> number of clusters: 2
preference = -170000 ---> number of clusters: 3
GaussianMixture:
n_components = 2
n_components = 3
Next, we'll compare these models using silhouette scores and diagrams. After that, we'll visualize the effects of the individual models on the data using 3-D projections.
X = get_fresh_X()
fig, axs = plt.subplots(2, 3, figsize=(12, 8))
axs = axs.flatten()
fig.suptitle("Comparison of most promising models",
fontsize=20, fontweight='bold')
fig.text(0.5, -0.004, "Silhouette coefficient values", ha="center", va="center", size=16)
fig.text(-0.005, 0.5, "Cluster label", ha="center", va="center", rotation=90, size=16)
# KMeans
kmeans_2 = KMeans(n_clusters=2, random_state=random_state)
labels_kmeans_2 = kmeans_2.fit_predict(X)
plot_silhouettes(X, 2, labels_kmeans_2, title='KMeans: n_clusters 2', metric='sqeuclidean', ax=axs[0], low_x=-0.4)
kmeans_3 = KMeans(n_clusters=3, random_state=random_state)
labels_kmeans_3 = kmeans_3.fit_predict(X)
plot_silhouettes(X, 3, labels_kmeans_3, title='KMeans: n_clusters 3', metric='sqeuclidean', ax=axs[3], low_x=-0.4)
# AffinityPropagation
additional_params = {'damping': 0.9,
'random_state': random_state}
ap_33 = AffinityPropagation(preference=-330000, **additional_params)
labels_ap_33 = ap_33.fit_predict(X)
plot_silhouettes(X, 2, labels_ap_33, title='Affinity Propagation: preference -330000', metric='sqeuclidean', ax=axs[1], low_x=-0.4)
ap_17 = AffinityPropagation(preference=-170000, **additional_params)
labels_ap_17 = ap_17.fit_predict(X)
plot_silhouettes(X, 3, labels_ap_17, title='Affinity Propagation: preference -170000', metric='sqeuclidean', ax=axs[4], low_x=-0.4)
# GaussianMixture
additional_params = {'random_state': random_state,
'tol': 0.1,
'max_iter': 500}
gm_2 = GaussianMixture(n_components=2, **additional_params)
labels_gm_2 = gm_2.fit_predict(X)
plot_silhouettes(X, 2, labels_gm_2, title='Gaussian Mixture: n_components 2', metric='sqeuclidean', ax=axs[2], low_x=-0.4)
gm_3 = GaussianMixture(n_components=3, **additional_params)
labels_gm_3 = gm_3.fit_predict(X)
plot_silhouettes(X, 3, labels_gm_3, title='Gaussian Mixture: n_components 3', metric='sqeuclidean', ax=axs[5], low_x=-0.4)
plt.tight_layout()
plt.show()
t-SNE (t-Distributed Stochastic Neighbor Embedding) reduces dimensionality to a n_components number, usually 2 or 3 when used for visualization. As we reduce the dataset to 3 dimensions in this case (from the original 6) we lose the original units, but still get an idea of which points are close together and which are far apart. We can then color each point by the label that was assigned by the model.
### Based on:
# https://towardsdatascience.com/cluster-analysis-create-visualize-and-interpret-customer-segments-474e55d00ebb
# https://github.com/MaartenGr/CustomerSegmentation/blob/master/Customer%20Segmentation.ipynb
def prepare_tsne(n_components, data, labels):
names = ['x', 'y', 'z']
matrix = TSNE(n_components=n_components, random_state=random_state).fit_transform(data)
df_matrix = pd.DataFrame(matrix)
df_matrix.rename({i:names[i] for i in range(n_components)}, axis=1, inplace=True)
df_matrix['labels'] = labels
df_matrix['labels'] = df_matrix['labels'].astype('str')
return df_matrix
def plot_3d(df, title='', color='labels', range_=[-5,5]):
fig = px.scatter_3d(df, x='x', y='y', z='z',
color=color, opacity=0.5, title=title,
range_x=range_, range_y=range_, range_z=range_,
color_continuous_scale=px.colors.diverging.RdBu,
width=900, height=600)
fig.update_traces(marker=dict(size=3))
fig.show()
tsne_kmeans_2 = prepare_tsne(3, X, labels_kmeans_2)
plot_3d(tsne_kmeans_2, title='KMeans: n_clusters 2', range_=[-20,20])
tsne_kmeans_3 = prepare_tsne(3, X, labels_kmeans_3)
plot_3d(tsne_kmeans_3, title='KMeans: n_clusters 3', range_=[-20,20])
tsne_ap_33 = prepare_tsne(3, X, labels_ap_33)
plot_3d(tsne_ap_33, title='Affinity Propagation: preference -330000', range_=[-20,20])
tsne_ap_17 = prepare_tsne(3, X, labels_ap_17)
plot_3d(tsne_ap_17, title='Affinity Propagation: preference -170000', range_=[-20,20])
tsne_gm_2 = prepare_tsne(3, X, labels_gm_2)
plot_3d(tsne_gm_2, title='Gaussian Mixture: n_components 2', range_=[-20,20])
tsne_gm_3 = prepare_tsne(3, X, labels_gm_3)
plot_3d(tsne_gm_3, title='Gaussian Mixture: n_components 3', range_=[-20,20])
The model that seems to perform best on this dataset happens to be the very first one we tried: KMeans with n_clusters=2.
Let's make one last visualization, once again using the t-SNE 3-D projection, but this time using the individual silhouette coefficients to color the points. In order to distinguish between labels, we can make one of the labels negative: for any instance assigned label '0', we multiply their silhouette coefficient by -1.
This way we get a range of [-1, 1] and can use a divergent colormap with samples on the border being lighter.
kmeans_2_label_sign = (labels_kmeans_2 - 0.5) * 2
kmeans_2_silhouettes = silhouette_samples(X, labels_kmeans_2, metric='sqeuclidean')
kmeans_2_silhouettes *= kmeans_2_label_sign
tsne_kmeans_2['signed_silhouettes'] = kmeans_2_silhouettes
plot_3d(tsne_kmeans_2, title='KMeans: n_clusters 2', color='signed_silhouettes', range_=[-20,20])
We can visualize how the features vary between the 2 clusters. As we can see below, income doesn't change much from one to the other. However, the other 5 features are quite a bit higher in cluster 0 than cluster 1, by approximately 30%.
# df = load_preprocess_data()
merged_df['kmeans_labels'] = labels_kmeans_2
tidy = merged_df.melt(id_vars='kmeans_labels')
fig, ax = plt.subplots(figsize=(15, 5))
sns.barplot(x='kmeans_labels', y='value', hue='variable', data=tidy, palette='Set3')
# plt.legend([''])
# plt.savefig("dbscan_mess.jpg", dpi=300)
<AxesSubplot:xlabel='kmeans_labels', ylabel='value'>
We can now answer the original questions:
Do the Chicago Public Schools fall into clusters according to this dataset?
The dataset doesn't have a 'naturally' clustered structure. None of the models presented a high silhouette score, with the highest-performing model producing a score slightly less than 0.5.
If not, is it possible to partion the data in a way that is informative?
Yes. KMeans provided clusters that, when analyzed by silhouette coefficients, gave us insights on some interesting differences.
What information separates the clusters/partitions?
According to the last bar-plot, there are significant differences of about 30% for all variables except 'income', which is virtually the same for both clusters.
How can the results provided by this analysis be used for action?
Clustering is most often used as just one step in an analytical or predictive process. As such, the cluster labels can be used as categorical features or, if the clustering algorithm supports soft clustering, we can use the predicted probabilities as a numerical feature.
That being said, I do not believe that these clusters should be used for predictive modeling since, as I mentioned earlier, there doesn't seem to be a clustering structure in the first place. We can still use these results as fodder for further analysis, for example by looking at how the clusters correspond to geographical locations or any other attributes.